library(tidyverse)
library(haven)
library(Synth)
library(devtools)
library(SCtools)
library(sjlabelled)
library(rlist)
Importing the data
<- read_dta("smoking.dta") %>%
smoking as.data.frame(.)
Just to get an idea of the data, we can run a summary of smoking.dta
<- summary(smoking)
smokingsum ::kable(smokingsum) knitr
state | year | cigsale | lnincome | beer | age15to24 | retprice | |
---|---|---|---|---|---|---|---|
Min. : 1 | Min. :1970 | Min. : 40.7 | Min. : 9.397 | Min. : 2.50 | Min. :0.1294 | Min. : 27.3 | |
1st Qu.:10 | 1st Qu.:1977 | 1st Qu.:100.9 | 1st Qu.: 9.739 | 1st Qu.:20.90 | 1st Qu.:0.1658 | 1st Qu.: 50.0 | |
Median :20 | Median :1985 | Median :116.3 | Median : 9.861 | Median :23.30 | Median :0.1781 | Median : 95.5 | |
Mean :20 | Mean :1985 | Mean :118.9 | Mean : 9.862 | Mean :23.43 | Mean :0.1755 | Mean :108.3 | |
3rd Qu.:30 | 3rd Qu.:1993 | 3rd Qu.:130.5 | 3rd Qu.: 9.973 | 3rd Qu.:25.10 | 3rd Qu.:0.1867 | 3rd Qu.:158.4 | |
Max. :39 | Max. :2000 | Max. :296.2 | Max. :10.487 | Max. :40.40 | Max. :0.2037 | Max. :351.2 | |
NA | NA | NA | NA’s :195 | NA’s :663 | NA’s :390 | NA |
To prepare our data, we need to 1) create a new column, statelabels, where the state names are listed corresponding to their state number; 2) remove the labels from columns statelabels and state; 3) lastly, to ensure compatibility (with the function dataprep() from Synth package), set state class as numeric and statelabel class as character.
#1
$statelabels <- as_label(smoking$state)
smoking
#2
remove_label(smoking, statelabels)
remove_label(smoking, state)
#3
$statelabels <- as.character(smoking$statelabels)
smoking$state <- as.numeric(smoking$state) smoking
The data frame now is prepped and looks like the following:
smoking
Here we set our indicator variables with: lnincome, beer, age15to24, and retprice. Additionally, we create lagged indicator variables from cigarette pack prices for the following years: 1988, 1980, and 1975. The dependent variable is cigsale and the treatment year is 1988.
<- dataprep(
dataprep_out foo = smoking,
predictors = c("lnincome", "beer", "age15to24", "retprice"),
predictors.op = "mean",
time.predictors.prior = 1970:1988,
special.predictors = list(
list("cigsale", 1988, "mean"),
list("cigsale", 1980, "mean"),
list("cigsale", 1975, "mean")),
dependent = "cigsale",
unit.variable = "state",
unit.names.variable = "statelabels",
time.variable = "year",
treatment.identifier = 3,
controls.identifier = c(1,2,4:39),
time.optimize.ssr = 1970:1988,
time.plot = 1970:2000
)<- synth(data.prep.obj = dataprep_out) synth_out
Running the following two lines of code gives us the following two graphs:
path.plot(synth_out, dataprep_out, Ylab = c("per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("California v Synthetic California"), tr.intake = 1988, Legend = c("Calfornia", "Synthetic California"))
gaps.plot(synth_out, dataprep_out, Ylab = c("gap in per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("Gap in Per-capital Cigarette Sales Between California and Synthetic"), tr.intake = 1988)
The fit and the gap look identical to the ones generated by Adbodie et al.
To generate the table to look our what donors contribute to our synthetic California, we can run the following code:
<- synth.tab(synth_out, dataprep_out, round.digit = 3)
synth_table ::kable(
knitrlist(synth_table$tab.pred, synth_table$tab.v),
booktabs = TRUE, valign = 't'
)
|
|
California and synthetic California means match up very well together. The same could be said about the sample mean, except for the lagged cigarette sales. We see that the synthetic control method put heavy weights on the lagged cigarette sales for years 1975 and 1980.
::kable(synth_table$tab.w) knitr
w.weights | unit.names | unit.numbers | |
---|---|---|---|
1 | 0.000 | Alabama | 1 |
2 | 0.000 | Arkansas | 2 |
4 | 0.094 | Colorado | 4 |
5 | 0.111 | Connecticut | 5 |
6 | 0.000 | Delaware | 6 |
7 | 0.000 | Georgia | 7 |
8 | 0.000 | Idaho | 8 |
9 | 0.000 | Illinois | 9 |
10 | 0.000 | Indiana | 10 |
11 | 0.000 | Iowa | 11 |
12 | 0.000 | Kansas | 12 |
13 | 0.000 | Kentucky | 13 |
14 | 0.000 | Louisiana | 14 |
15 | 0.000 | Maine | 15 |
16 | 0.000 | Minnesota | 16 |
17 | 0.000 | Mississippi | 17 |
18 | 0.000 | Missouri | 18 |
19 | 0.205 | Montana | 19 |
20 | 0.000 | Nebraska | 20 |
21 | 0.249 | Nevada | 21 |
22 | 0.000 | New Hampshire | 22 |
23 | 0.001 | New Mexico | 23 |
24 | 0.000 | North Carolina | 24 |
25 | 0.000 | North Dakota | 25 |
26 | 0.000 | Ohio | 26 |
27 | 0.000 | Oklahoma | 27 |
28 | 0.000 | Pennsylvania | 28 |
29 | 0.000 | Rhode Island | 29 |
30 | 0.000 | South Carolina | 30 |
31 | 0.000 | South Dakota | 31 |
32 | 0.000 | Tennessee | 32 |
33 | 0.000 | Texas | 33 |
34 | 0.341 | Utah | 34 |
35 | 0.000 | Vermont | 35 |
36 | 0.000 | Virginia | 36 |
37 | 0.000 | West Virginia | 37 |
38 | 0.000 | Wisconsin | 38 |
39 | 0.000 | Wyoming | 39 |
We see that our synthetic California is made up of 0.341 Utah, 0.249 Nevada, 0.205 Montana, 0.111 Connecticut, 0.094 Colorado, and 0.001 New Mexico. This composition and the predictor means are extremely close to the ones obtained by Abodie et al.
Now we can generate our placebos and the corresponding p-values:
<- generate.placebos(dataprep_out, synth_out, Sigf.ipop = 4, strategy = "multiprocess") placebos
plot_placebos(placebos)
mspe.plot(placebos, discard.extreme = FALSE, mspe.limit = 1, plot.hist = TRUE)
# calculate and print p-value
<- mspe.test(placebos, discard.extreme = F)
ratio $p.val ratio
## [1] 0.02564103
California has the highest post/pre MSPE ratio and the The probability of obtaining a post/pre MSPE ratio as high as California is 1/39, or 0.02564, which is the same result as obtained by Abodie et al. The average treatment effect from Proposition 99 is estimated to be around a decrease of 24 packs per year from year 1988 to 2000.
To drop Utah from the set of controls, we can just rerun the code from part 2, dropping 34 (Utah’s numeric label) from our donor pool (controls.identifier), and appending NU (no Utah) to our renamed dataframes (dataprep_out_NU and synth_out_NU).
<- dataprep(
dataprep_out_NU foo = smoking,
predictors = c("lnincome", "beer", "age15to24", "retprice"),
predictors.op = "mean",
time.predictors.prior = 1970:1988,
special.predictors = list(
list("cigsale", 1988, "mean"),
list("cigsale", 1980, "mean"),
list("cigsale", 1975, "mean")),
# list("retprice", 1970:1988, "mean")),
dependent = "cigsale",
unit.variable = "state",
unit.names.variable = "statelabels",
time.variable = "year",
treatment.identifier = 3,
controls.identifier = c(1,2,4:33,35:39),
time.optimize.ssr = 1970:1988,
time.plot = 1970:2000
)<- synth(data.prep.obj = dataprep_out_NU) synth_out_NU
path.plot(synth_out_NU, dataprep_out_NU, Ylab = c("per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("California v Synthetic California"), tr.intake = 1988, Legend = c("Calfornia", "Synthetic California"))
gaps.plot(synth_out_NU, dataprep_out_NU, Ylab = c("gap in per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("Gap in Per-capital Cigarette Sales Between California and Synthetic"), tr.intake = 1988 )
Our fit and gap look pretty good here. The average treatment effect looks similar to the result above at around a decrease of 20 packs per year.
<- synth.tab(synth_out_NU, dataprep_out_NU, round.digit = 3)
synth_table_NU ::kable(
knitrlist(synth_table_NU$tab.pred, synth_table_NU$tab.v),
booktabs = TRUE, valign = 't'
)
|
|
Our indicator means, like in the first part, matches quite well, except for beer. Dropping Utah results in a much heavier weighting towards lagged cigarette sale in 1975.
::kable(synth_table_NU$tab.w) knitr
w.weights | unit.names | unit.numbers | |
---|---|---|---|
1 | 0.000 | Alabama | 1 |
2 | 0.000 | Arkansas | 2 |
4 | 0.139 | Colorado | 4 |
5 | 0.069 | Connecticut | 5 |
6 | 0.000 | Delaware | 6 |
7 | 0.000 | Georgia | 7 |
8 | 0.048 | Idaho | 8 |
9 | 0.000 | Illinois | 9 |
10 | 0.000 | Indiana | 10 |
11 | 0.000 | Iowa | 11 |
12 | 0.000 | Kansas | 12 |
13 | 0.000 | Kentucky | 13 |
14 | 0.000 | Louisiana | 14 |
15 | 0.000 | Maine | 15 |
16 | 0.000 | Minnesota | 16 |
17 | 0.000 | Mississippi | 17 |
18 | 0.000 | Missouri | 18 |
19 | 0.000 | Montana | 19 |
20 | 0.000 | Nebraska | 20 |
21 | 0.179 | Nevada | 21 |
22 | 0.000 | New Hampshire | 22 |
23 | 0.563 | New Mexico | 23 |
24 | 0.000 | North Carolina | 24 |
25 | 0.000 | North Dakota | 25 |
26 | 0.000 | Ohio | 26 |
27 | 0.000 | Oklahoma | 27 |
28 | 0.000 | Pennsylvania | 28 |
29 | 0.000 | Rhode Island | 29 |
30 | 0.000 | South Carolina | 30 |
31 | 0.000 | South Dakota | 31 |
32 | 0.000 | Tennessee | 32 |
33 | 0.000 | Texas | 33 |
35 | 0.000 | Vermont | 35 |
36 | 0.000 | Virginia | 36 |
37 | 0.000 | West Virginia | 37 |
38 | 0.000 | Wisconsin | 38 |
39 | 0.000 | Wyoming | 39 |
In this instance, after dropping Utah, synthetic California becomes 0.564 New Mexico, 0.179 Nevada, 0.130 Colorado, 0.069 Connecticut, and 0.048 Idaho.
<- generate.placebos(dataprep_out_NU, synth_out_NU, Sigf.ipop = 4, strategy = "multiprocess") placebos_NU
plot_placebos(placebos_NU)
mspe.plot(placebos_NU, discard.extreme = FALSE, mspe.limit = 1, plot.hist = TRUE)
# calculate and print p-value
<- mspe.test(placebos_NU, discard.extreme = F)
ratio_NU $p.val ratio_NU
## [1] 0.1052632
After dropping Utah, California no longer has the highest post/pre MSPE ratio. The probability of obtaining a post/pre MSPE ratio as high as California is 4/39 or 0.10256. California’s MSPE ratio is also not as high as in part 2. Although the pre-treatment fit is still good, the pseudo p-value does not indicate significance.
Synthesizing our California with Utah and without the lagged cigarette pack sales, we can take out the special indicators and put 34 back into our donor pool with the following code, appending NS (no special) to our data frames:
<- dataprep(
dataprep_out_NS foo = smoking,
predictors = c("lnincome", "beer", "age15to24", "retprice"),
predictors.op = "mean",
time.predictors.prior = 1970:1988,
dependent = "cigsale",
unit.variable = "state",
unit.names.variable = "statelabels",
time.variable = "year",
treatment.identifier = 3,
controls.identifier = c(1,2,4:39),
time.optimize.ssr = 1970:1988,
time.plot = 1970:2000
)<- synth(data.prep.obj = dataprep_out_NS) synth_out_NS
path.plot(synth_out_NS, dataprep_out_NS, Ylab = c("per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("California v Synthetic California"), tr.intake = 1988, Legend = c("Calfornia", "Synthetic California"))
gaps.plot(synth_out_NS, dataprep_out_NS, Ylab = c("gap in per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("Gap in Per-capital Cigarette Sales Between California and Synthetic"), tr.intake = 1988 )
Without the lagged cigarette pack sales, the pre-treatment fit between California and the synthetic California is worse and consequently, the gap between the two is also worse. Our estimated treated effect looks to be a larger decrease, near 30 packs per year, however, the years leading up to the treatment year are badly fitted.
<- synth.tab(synth_out_NS, dataprep_out_NS, round.digit = 3)
synth_table_NS ::kable(
knitrlist(synth_table_NS$tab.pred, synth_table_NS$tab.v),
booktabs = TRUE, valign = 't'
)
|
|
The means look pretty much the same, except beer and retprice are a little off. However, they are still more closely matched with California compared to the sample. In this case, a large proportion of the variable weight is on lnincome.
::kable(synth_table_NS$tab.w) knitr
w.weights | unit.names | unit.numbers | |
---|---|---|---|
1 | 0.000 | Alabama | 1 |
2 | 0.000 | Arkansas | 2 |
4 | 0.514 | Colorado | 4 |
5 | 0.477 | Connecticut | 5 |
6 | 0.000 | Delaware | 6 |
7 | 0.000 | Georgia | 7 |
8 | 0.000 | Idaho | 8 |
9 | 0.000 | Illinois | 9 |
10 | 0.000 | Indiana | 10 |
11 | 0.000 | Iowa | 11 |
12 | 0.000 | Kansas | 12 |
13 | 0.000 | Kentucky | 13 |
14 | 0.000 | Louisiana | 14 |
15 | 0.000 | Maine | 15 |
16 | 0.000 | Minnesota | 16 |
17 | 0.000 | Mississippi | 17 |
18 | 0.000 | Missouri | 18 |
19 | 0.000 | Montana | 19 |
20 | 0.000 | Nebraska | 20 |
21 | 0.000 | Nevada | 21 |
22 | 0.000 | New Hampshire | 22 |
23 | 0.000 | New Mexico | 23 |
24 | 0.000 | North Carolina | 24 |
25 | 0.000 | North Dakota | 25 |
26 | 0.000 | Ohio | 26 |
27 | 0.000 | Oklahoma | 27 |
28 | 0.000 | Pennsylvania | 28 |
29 | 0.000 | Rhode Island | 29 |
30 | 0.000 | South Carolina | 30 |
31 | 0.000 | South Dakota | 31 |
32 | 0.000 | Tennessee | 32 |
33 | 0.000 | Texas | 33 |
34 | 0.000 | Utah | 34 |
35 | 0.000 | Vermont | 35 |
36 | 0.001 | Virginia | 36 |
37 | 0.000 | West Virginia | 37 |
38 | 0.000 | Wisconsin | 38 |
39 | 0.000 | Wyoming | 39 |
We see that in this case, our synthetic California is comprised of 0.514 Colorado, 0.477 Connecticut, and 0.001 Virginia. This is a much more different composition and weighting compared to the previous two.
<- generate.placebos(dataprep_out_NS, synth_out_NS, Sigf.ipop = 4, strategy = "multiprocess") placebos_NS
plot_placebos(placebos_NS)
mspe.plot(placebos_NS, discard.extreme = FALSE, mspe.limit = 1, plot.hist = TRUE)
# calculate and print p-value
<- mspe.test(placebos_NS, discard.extreme = F)
ratio_NS $p.val ratio_NS
## [1] 0.1025641
After removing the lagged cigarette pack sales, California, like in part 3, no longer has the highest post/pre MSPE ratio. The probability of obtaining a post/pre MSPE ratio as high as California is the same as in part 3: 4/39 or 0.10256. The MSPE ratio is also less than compared to part 3.
<- dataprep(
dataprep_out_81 foo = smoking,
predictors = c("lnincome", "age15to24", "retprice"),
predictors.op = "mean",
time.predictors.prior = 1970:1981,
special.predictors = list(
list("cigsale", 1981, "mean"),
list("cigsale", 1980, "mean"),
list("cigsale", 1975, "mean")),
dependent = "cigsale",
unit.variable = "state",
unit.names.variable = "statelabels",
time.variable = "year",
treatment.identifier = 3,
controls.identifier = c(1,2,4:39),
time.optimize.ssr = 1970:1981,
time.plot = 1970:1988
)<- synth(data.prep.obj = dataprep_out_81) synth_out_81
I received an error stating that the indicator variable “beer” is missing data for all time periods before the treatment at 1981. Therefore, in order to run the code, I had to drop beer from the predictors. I also changed the lagged cigarette pack sales from year 1988 to 1981. Everything else remains unchanged.
path.plot(synth_out_81, dataprep_out_81, Ylab = c("per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("California v Synthetic California"), tr.intake = 1981, Legend = c("Calfornia", "Synthetic California"))
gaps.plot(synth_out_81, dataprep_out_81, Ylab = c("gap in per-capita cigarette sales (in packs)"), Xlab = c("Year"), Main = c("Gap in Per-capital Cigarette Sales Between California and Synthetic"), tr.intake = 1981)
<- synth.tab(synth_out_81, dataprep_out_81, round.digit = 3)
synth_table_81 ::kable(
knitrlist(synth_table_81$tab.pred, synth_table_81$tab.v),
booktabs = TRUE, valign = 't'
)
|
|
Our means here fit extremely well and the lagged cigarette sales fit much better than the sample mean. Like in the first case, the variable weighting is on mostly lagged cigarette sales on 1975 and 1980.
::kable(synth_table_81$tab.w) knitr
w.weights | unit.names | unit.numbers | |
---|---|---|---|
1 | 0.000 | Alabama | 1 |
2 | 0.000 | Arkansas | 2 |
4 | 0.001 | Colorado | 4 |
5 | 0.331 | Connecticut | 5 |
6 | 0.000 | Delaware | 6 |
7 | 0.000 | Georgia | 7 |
8 | 0.000 | Idaho | 8 |
9 | 0.001 | Illinois | 9 |
10 | 0.000 | Indiana | 10 |
11 | 0.000 | Iowa | 11 |
12 | 0.000 | Kansas | 12 |
13 | 0.000 | Kentucky | 13 |
14 | 0.000 | Louisiana | 14 |
15 | 0.000 | Maine | 15 |
16 | 0.000 | Minnesota | 16 |
17 | 0.000 | Mississippi | 17 |
18 | 0.000 | Missouri | 18 |
19 | 0.000 | Montana | 19 |
20 | 0.000 | Nebraska | 20 |
21 | 0.305 | Nevada | 21 |
22 | 0.000 | New Hampshire | 22 |
23 | 0.000 | New Mexico | 23 |
24 | 0.000 | North Carolina | 24 |
25 | 0.000 | North Dakota | 25 |
26 | 0.000 | Ohio | 26 |
27 | 0.000 | Oklahoma | 27 |
28 | 0.000 | Pennsylvania | 28 |
29 | 0.000 | Rhode Island | 29 |
30 | 0.000 | South Carolina | 30 |
31 | 0.000 | South Dakota | 31 |
32 | 0.000 | Tennessee | 32 |
33 | 0.000 | Texas | 33 |
34 | 0.361 | Utah | 34 |
35 | 0.000 | Vermont | 35 |
36 | 0.000 | Virginia | 36 |
37 | 0.000 | West Virginia | 37 |
38 | 0.000 | Wisconsin | 38 |
39 | 0.000 | Wyoming | 39 |
In this case, the synthetic California is made up of 0.361 Utah, 0.305 Nevada, 0.331 Connecticut, and 0.001 Illinois. The indicator variables match up extremely well and the pre-treatment fit looks almost perfect. However, the real California and synthetic California continue to be well fitted a few years past the placebo treatment date.
<- generate.placebos(dataprep_out_81, synth_out_81, Sigf.ipop = 4, strategy = "multiprocess") placebos_81
plot_placebos(placebos_81)
mspe.plot(placebos_81, discard.extreme = FALSE, mspe.limit = 1, plot.hist = TRUE)
# calculate and print p-value
<- mspe.test(placebos_81, discard.extreme = F)
ratio_81 $p.val ratio_81
## [1] 0.1538462
In this case, the probability of obtaining a post/pre MSPE ratio as large as California’s is 6/39 = 0.153846. The average treatment effect in this in-time placebo case would be about a 8-10 pack decrease per year from year 1981 to 1988, although the pseudo p-value is not significant. This result is an even lower estimated effect compared to the one study by Fichtenberg and Glantz (an estimated decrease of 14 packs per year) that Abodie et al. mentions.
I found and graphed the post and pre MSPE for California and the control units and the post/pre MSPE ratio at the end of each parts 2-5. Summarizing the pseudo p-values and the ATET here:
Normal | No Utah | No Lagged | In-time Placebo | |
---|---|---|---|---|
Pseudo p-values | 0.02564 | 0.10256 | 0.10256 | 0.15385 |
ATET (packs per year) | ~-24 | ~-20 | ~-30 | ~-9 |
We see that only the pseudo p-values from the first synthetic control is significant.